Automated generation of phonemic lexicon for voice activated cockpit management systems

ABSTRACT

A system, method and program for acquiring from an input text a character string set and generating the pronunciation thereof which should be recognized as a word is disclosed. The system selects from an input text, plural candidate character strings which are phonemic character candidates or allophones to be recognized as a word; generates plural pronunciation candidates of the selected candidate character string and outputs the optimum pronunciation candidate to be recognized as a word; generates phonemic dictionary by combining data in which the pronunciation candidate with optimal recognition is respectively associated with the character strings; generates recognition data in which character strings respectively indicating plural words contained in the input speech are associated with pronunciations; and outputs a combination contained in the recognition data, out of combinations each consisting of one of the candidate character strings and the one of the pronunciations candidates with the optimum recognition.

CROSS-REFERENCE TO RELATED APPLICATION

The present disclosure is a continuation of U.S. patent application Ser. No. 14/498,897, filed on Sep. 26, 2014, which was issued a Notice of Allowance on May 13, 2015, the entirety of which is herein incorporated by reference. U.S. patent application Ser. No. 14/498,897 claims priority to U.S. Provisional Patent Application Ser. No. 61/907,429, filed on Feb. 7, 2014, the entirety of which is herein incorporated by reference.

REFERENCES SEARCHED AND CITED US Patent Documents

8,768,704 B1 July 2014 Fructuoso et al. 8,706,680 B1 April 2014 Macfarlane 6,125,341 A September 2000 Raud et al. 6,745,165 B2 June 2004 Lewis 7,010,490 B2 March 2006 Brocious 4,725,956 February 1998 Jenkins 5,926,790 A July 1999 Wright 6,044,322 A March 2000 Stietler 6,173,192 B1 January 2001 Clark 6,285,926 B1 September 2001 Weiler et al. 6,512,527 B1 January 2003 Barber 6,529,706 B1 March 2003 Mitchell 6,567,395 B1 May 2003 Miller 6,704,553 B1 March 2004 Eubanks 6,720,890 B1 April 2004 Ezroni 6,832,152 B1 December 2004 Bull 7,606,327 B2 October 2009 Walker et al. 7,606,715 B1 October 2009 Krenz 5,745,649 A April 1998 Lubensky 6,018,708 A January 2000 Dahan et al. 6,243,680 B1 June 2001 Gupta et al.

US Patent Application Documents

2012/0116766 A1 May 2012 WASSERBLAT et al.

FIELD OF THE INVENTION

The present invention relates generally to voice activated aircraft cockpit management systems, and more particularly to the automation of lexicon generation and data entry procedures wherein formatted files are used in a voice recognition process as part of the voice activated cockpit operation procedures, operation and control of aircraft systems by voice, as pertaining to single and multi-engine small, large and commercial-size aircraft utilizing a voice recognition system.

More particularly, the present invention relates to a system, method and program for acquiring a character string and the like that should be newly recognized as a word. More specifically, the present invention relates to a system, a method, and a program for acquiring, for speech processing, a character string set and relaying a pronunciation that should be recognized as a word.

BACKGROUND OF THE INVENTION

In a large vocabulary continuous speech recognition system, highly accurate speech recognition requires a word dictionary in which words and phrases included in the speech are recorded and a language model by which a data score of each word or phrase may be derived, such as appearance score, accuracy score and other data scores. Due to the limitations of both the capacity of current storage devices for memorizing a dictionary and CPU performance for calculating data score values, it is desirable that these word dictionaries and this language model be minimized.

Moreover, enormous amounts of time, effort, and expense are required for manual construction of a dictionary containing even only a minimum amount of words and phrases. More specifically, when a dictionary is constructed from text, it is necessary to analyze segmentation of words, firstly, and then to assign a correct pronunciation to each of the segmented words. Since a pronunciation is information on a reading way expressed with phonetic symbols and the like, expert linguistic knowledge is necessary in order to assign such information of a pronunciation in many cases. Such work and expense can be a problem because information such as a general dictionary that's been accumulated may not be useful.

Conventional studies have been made for techniques for automatically detecting, to some extent, character strings into a text that should be recognized as words. The present invention relates to a system, method and program that will automatically generate a character string that is newly recognized as pronunciation of a word. Prior art techniques used to date merely support manual detection work while others require time intensive manual correction work since the detected character string contains lots of unnecessary words even though the character strings and the pronunciation may only be partially detected.

Voice recognition systems as an alternative for man-machine-interfaces are becoming more and more widely used. However, in aircraft flight environment conditions they have found limited use due to the unique challenges presented by elevated noise levels, unique grammar rules, unique vocabulary, and/or hardware limitations all associated with the cockpit environment. Meanwhile, command recognitions or selections from address book entries in mobile devices, such as mobile phones, are standard functions. In automobiles, speech recognition systems are applied to record, e.g. a starting point and an end point in a navigation or GPS system.

Voice Recognition algorithms rely upon grammar and semantics to determine the best possible text match(s) to the uttered phrase(s). Conventionally they are based on Hidden-Markov-models, which enable recognition but require high computing time. Since embedded systems are often employed as computing entities, having limited computing and storing resources has added to the limitation of applications of the voice recognition to the cockpit environment to date, and engendered simplified speech recognition. Constraints in the search space and saving of the resources is coming along with less reliable speech recognition and/or less comfortable handling for the user in addition to the specific limitations imposed by the cockpit environment.

The aircraft operating environment is very unique in the grammar rules that are followed and the vocabulary that is used. The grammar suite is rather extensive including “words” that represent unusual collections of characters (e.g. intersection or fix names). Same goes for the vocabulary with specific code “words” that engender particular sequences of actions in the cockpit that are known only to professionally trained pilots and not available through use of colloquial language. Elongation of the expression to be recognized within colloquial language even without the complexity of the pilotage grammar and vocabulary will lead to extremely high requirements in memory and computing power. These factors make it difficult to develop a comprehensive grammar and vocabulary set for use on an aircraft, and this has represented one of several significant challenges to bringing voice recognition to the cockpit. The elevated noise environment in flight conditions can increase in the cockpit up to 6-7 times the general room noise level found on the ground, which adds to the complexity of the task since specialized hardware and additional technology that engender voice recognition and is required.

Others have attempted to use dynamic grammar for enhancing voice recognition systems. For example, U.S. Pat. No. 6,125,341, entitled “Speech Recognition System and Method,” issued to H. F. Raud et al, discloses a speech recognition system having multiple recognition vocabularies, and a method of selecting an optimal working vocabulary used by the system. Each vocabulary is particularly suited for recognizing speech in a particular language, or with a particular accent or dialect. The system prompts a speaker for an initial spoken response; receives the initial spoken response; and, compares the response to sets of possible responses in an initial speech recognition vocabulary to determine a response best matched in the initial vocabulary. A working speech recognition vocabulary is selected from a plurality of speech recognition vocabularies, based on the best matched response.

U.S. Pat. No. 6,745,165, entitled “Method and Apparatus For Recognizing From Here To Here Voice Command Structures in a Finite Grammar Speech Recognition System,” issued to J. R. Lewis et al, discloses a method and system that uses a finite state command grammar coordinated with application scripting to recognize voice command structures for performing an event from an initial location to a new location. The method involves a series of steps, including: recognizing an enabling voice command specifying the event to be performed from the initial location; determining a functional expression for the enabling voice command defined by one or more actions and objects; storing the action and object in a memory location; receiving input specifying the new location; recognizing an activating voice command for performing the event up to the new location; retrieving the stored action and object from the memory location; and performing the event from the initial location to the new location according to the retrieved action and object. Preferably, the enabling-activating command is phrased as “from here . . . to here.” The user specifies the new location with voice commands issued subsequent to the enabling command. To reduce the occurrence of unintended events, these voice commands are counted so that if they exceed a predetermined limit, the action and object content is cleared from memory.

U.S. Pat. No. 7,010,490, entitled “Method, System, and Apparatus for Limiting Available Selections in a Speech Recognition System,” issued to L. A. Brocious et al, discloses a method and system for completing user input in a speech recognition system. The method can include a series of steps which can include receiving a user input. The user input can specify an attribute of a selection. The method can include comparing the user input with a set of selections in the speech recognition system. Also, the method can include limiting the set of selections to an available set of selections which can correspond to the received user input. The step of matching a received user spoken utterance with the selection in the available set of selections also can be included.

Generally, any variation in the grammar implemented in a voice recognition system is based upon previous commands or states computed within the voice recognition system. Such types of systems would have limited applicability in an avionics environment because the grammar in cockpit management systems is very fragmented for specific cockpit procedural functions.

Current method for automated lexicon generation to be used in voice recognition provides a dictionary for the voice commands of these cockpit procedural functions to be recognized by pronunciation or phonemic syntax of allophones, and not by means of translating specific text words that engender display of procedures available to this date in operational cockpits in hard copy or visual display.

SUMMARY OF THE INVENTION

A system, method and program is provided for generating from an input text a character string set which represents the pronunciation thereof, which should be recognized as a word. The system includes a candidate generation unit for generating from an input text wherein at least one candidate character string becomes the candidate to be recognized as a word; a pronunciation generating unit for generating at least one pronunciation candidate for each of the selected candidate character strings by optimizing among the pronunciations of all characters contained in the selected candidate pronunciation string while one or more pronunciations are predetermined for each character; a phonemic dictionary unit for generating phonemic data by combining data in which the generated pronunciation candidates are respectively associated with the character strings with language model data; a speech recognizing unit for performing based on the recognition of individual phonemic characters or allophones and a language model, speech recognition on the input speech to generate recognition data in which phonemic character strings respectively indicating plural words contained in the input speech are associated with pronunciations; and an outputting unit for outputting a combination contained in the recognition data out of combinations each consisting of one of the candidates of a pronunciation thereof. Additionally, a program for enabling an information processing apparatus as the system is provided.

BRIEF DESCRIPTION OF THE DRAWINGS

For a complete understanding of the present invention and the advantage thereof, reference now is made to the following description taken in conjunction with the accompanying drawings.

FIG. 1 illustrates an example of a process in which a set of a character string and the pronunciation thereof which should be recognized as a word are newly acquired and the configuration of a word acquisition system 100 and an entirety of a periphery thereof according to the present invention;

FIG. 2 illustrates a process 200 in which the word acquisition system 100 (FIG. 1) selects and outputs a character string that should be recognized as a word with optimum phonemic recognition; and

FIG. 3 shows a flow of processing in which the command acquisition system automatically selects and outputs an audio string that should be recognized as a procedure.

DETAILED DESCRIPTION OF THE INVENTION

Although the present invention will be described below by the way of an embodiment of the invention, the following embodiment does not limit the invention according to the scope of claims, and not all of combinations of characteristics described in the embodiment are essential for the solving means of the invention. Turning now to FIG. 1 there is shown an example of a process for newly acquiring a set of a character string and the pronunciations thereof which should be recognized as a word. This first example is an example where a speech recognition system is used for acquisition of the character string and the pronunciation.

Firstly, when a text is recognized by a system which supports the acquisition of the character string 100 generates plural candidates for pronunciations of a character string. Next, a speech recognition system compares each of these pronunciation candidates with the input speech acquired from a user. As a result, the candidate which is a pronunciation that is most similar to the input speech is selected and outputted in association with a character string. By using the speech recognition system in this manner, a character string of a new word not registered in the phonemic dictionary of the speech recognition system can be acquired in association with a pronunciation thereof.

As described above, use of such processing results in a new word being acquired. However, a large amount of work and time is required if misrecognized words are numerous during construction of a dictionary of a specific field of expertise. FIG. 1 shows the configuration of the word acquisition system 100 and an entire periphery thereof according to this embodiment. A speech and a text are inputted to the word acquisition system 100. These text and speech are of the content of a common event of a predetermined field. As for the predetermined fields, it is desirable to select one of the fields expected to contain certain words that are to be registered in the dictionary for speech recognition used in voice activation of cockpit management systems. Hereinafter, a speech and a text which have been inputted will be referred to as an input text and input speech.

The word acquisition system 100 selects from the input text at least one candidate character string which is a candidate to be recognized as a word. The word acquisition system 100 then generates a plurality of candidates for the pronunciation of each selected candidate character string. Data thus generated will be referred to as candidate data. On the other hand, the voice recognition system calculates a confidence score at which the candidate string appears in the input text. Herein data obtained by calculating confidence scores will be referred to as language model data 110. The language model data 110 may be a numerical value calculated for each candidate character string(s). Instead of or in addition to this, the language model data 110 may be a numerical value calculated for each set of plural consecutive candidate number strings.

Next the word acquisition system 100 combines the language model data 110 with the candidate data 102 and generates an optimization score, each piece of which indicates the optimum recognition accuracy of a set of a character string indicating a word and pronunciation thereof 104. From the sets of candidate character strings and pronunciations candidates generated as candidate data 104, the word acquisition system 100 selects a character string set and pronunciations which has been obtained in the course of processing of speech recognition. The word acquisition system 100 then outputs the selected set to a speech processing apparatus 130. That is, outputted is a word whose pronunciation appears in the input speech, and whose corresponding character string appears at a high confidence score in the input text. In a case where the speech recognition system employs an n-gram model, what is taken into consideration is not only confidence score of an individual word but also the confidence score of the preceding and succeeding word in the context 106.

The words having thus been outputted may be registered in a dictionary memorizing unit 132 and be used by the speech recognition apparatus as dictionary for speech processing in a field corresponding to the input speech and input text. For example, by using the dictionary memorizing unit the speech recognition apparatus 130 recognizes the input speech and outputs actuation of functions indicating the result of the recognition of the voice commands.

Referring now to FIG. 2, there is shown a process flow in which the word acquisition system 200 selects and outputs a character string that should be recognized as a word. Firstly, the candidate selecting unit 210 selects candidate character strings from the input text. So as to enhance efficiency of subsequent processing, it is desirable that the candidate character strings be limited to those likely to be recognized as words. Next, with respect to each of the selected candidate strings, the pronunciation generating unit 220 generates at least one pronunciation candidate. The pronunciation candidate may be generated based on the pronunciation dictionary described above or may be generated by use of a technique called allophone n-gram. The technique called allophone n-gram is a technique utilizing the confidence score at which each character and its pronunciation appears in a training speech and text which indicate the same contents as each other.

The confidence score generating unit 224 then performs the following processing in order to generate confidence score data. More specifically, the confidence score generating unit generates the language model data 222 based on the input text. More particularly, the confidence score generating unit 224 first finds the score at which each of the character string contained in an input text appears in the input text and/or the confidence score at which each of the character strings and other character strings consecutively appear in the input text. Then the confidence score generating unit 224 generates the model language data by calculating based on the confidence scores, the optimum accuracy that each of the candidate character strings appears at.

Next the confidence score generating unit 224 generates the accuracy score data by combining with the language model data 222 (the candidate data in which the pronunciation candidates are respectively associated with the candidate character strings.) The confidence score is configured to express a score of each set of candidate character strings and pronunciation thereof.

FIG. 3 shows a flow of processing in which the command acquisition system 300 automatically selects and outputs an audio string that should be recognized as a procedure. Firstly the candidate selecting unit 302 selects candidate command audio strings from the input text. So as to enhance efficiency of subsequent processing, it is desirable that the candidate audio strings be limited to those likely to be recognized as procedures. Next with respect to each of the selected candidate strings, the grammar pronunciation generating unit 304 generates at least one grammar pronunciation candidate 306. The grammar pronunciation candidate may be generated based on the pronunciation dictionary as has been described above, or may be generated by use of a technique called allophone n-gram. The technique called allophone n-gram is a technique utilizing the confidence score at which each character and its pronunciation appears in a training speech and text which indicate the same contents as each other.

Then the vocabulary generating unit 310 performs the following processing in order to generate the vocabulary file data 312. In the first place the vocabulary generating unit 308 generates the procedure language model data 310 based on the input text. More specifically the vocabulary generating unit 308 first finds the score at which each of the audio string step contained in an input text appears in the input text, and/or the confidence score at which each of the audio string steps and other command audio strings consecutively appear in the input text. Then the vocabulary generating unit 308 generates the model language data 310 by calculating, based on the confidence scores, the optimum accuracy that each of the candidate audio strings appears at. Next the vocabulary generating unit 308 generates the accuracy score data by combining with the procedure language model data 310, the candidate data in which the pronunciation candidates are respectively associated with the candidate audio strings. The confidence score is configured to express a score of each set of candidate audio strings steps and pronunciation thereof.

From the sets of candidate audio strings and pronunciations candidates generated as candidate vocabulary data 308 and the candidate grammar data 306, the procedure acquisition system 300 selects a set of audio string steps and pronunciations which has been obtained in the course of processing of speech recognition. The procedure acquisition system 300 then outputs the selected set of audio vocabulary file 312 and audio grammar file 314 to a speech processing apparatus 130. That is, outputted is a procedure 316 whose pronunciation appears in the input speech, and whose corresponding audio string steps appear at a high confidence score in the input text. In a case where the speech recognition system employs an n-gram model, what is taken into consideration is not only confidence score of an individual step but also the confidence score of the preceding and succeeding step in the context.

The procedures having been thus output may be registered in a computer memorizing unit, and be used by the speech recognition apparatus 130 as procedure library for speech processing in a field corresponding to the input speech of cockpit audio operational procedures. For example by using the dictionary memorizing unit the speech recognition apparatus 130 recognizes the input speech, and outputs actuation of functions indicating the result of the recognition of the voice commands. 

What is claimed is:
 1. An automation method of acquiring, from an input text and an input speech, a set of allophone character string and a pronunciation thereof which should be recognized as a word, a word in a sentence, and a word in a procedure, the automation method comprising operating one or more processors executing stored program instructions automatically to: select, from the input text, at least one allophone candidate character string which is a candidate to be recognized as a word; generate at least one pronunciation candidate of each of the selected allophone candidate character strings by combining predetermined pronunciations of all allophone characters contained in the selected allophone candidate character string, while one or more pronunciations are predetermined for each of the allophone characters; generate score data by combining data in which the generated pronunciation candidates are respectively associated with the allophone character strings, with language model data prepared by previously recording numerical values based on scores at which the respective words appear in the text and speech; the score data indicating appearance accuracy of the respective sets each consisting of an allophone character string indicating a word, a word in a sentence, a sentence in a procedure, and the pronunciation thereof; based on the generated score data, perform speech recognition on the input speech to generate recognition data in which allophone character strings respectively indicating plural words contained in the input speech are associated with pronunciations; and select and output a combination contained in the recognition data, out of combinations each comprising one of the allophone candidate character strings and one of the pronunciation candidates.
 2. A computer program product embodied in computer readable memory for enabling an information processing apparatus to function as a system for acquiring, from an input text and input speech, a set of allophone character strings and the pronunciation thereof which should be recognized as a word, a word in a sentence, and a sentence in a procedure, the computer program product comprising stored program instructions which, when executed by one or more processors, enable the information processing apparatus to function as: a candidate selecting unit for selecting, form the input text, at least one allophone candidate character string which is a candidate to be recognized as a word, a word in a sentence, and a sentence in a procedure; a pronunciation generating unit for generating at least one pronunciation candidate of each of the selected allophone candidate character strings by combining pronunciations of all allophone characters contained in the selected allophone candidate character strings, while one or more pronunciations are predetermined for each of the allophone characters; a score generating unit for generating confidence score data by combining data in which the generated pronunciations candidates are respectively associated with the allophone character strings, with language model data prepared by previously recording numerical values based on accuracy with which respective words appear in the text, the accuracy data indicating the appearance accuracy of respective sets each consisting of an allophone character string indicating a word, a word in a sentence, and a sentence in a procedure, and the pronunciation thereof; a speech recognizing unit for performing, based on the generated confidence data, speech recognition on the input speech to generate recognition data in which allophone character strings respectively indicating plural words contained in the input speech are associated with pronunciations; and an outputting unit for selecting and outputting a combination contained in the recognition data, out of combinations each consisting of one of the allophone candidate character strings and one of the candidates of a pronunciation thereof.
 3. A method for acquiring, from an input text and an input speech, a set of an allophone character string and a pronunciation thereof which should be recognized as a word, a word in a sentence, and a sentence in a procedure, the method comprising: an allophone candidate selecting unit wherein the allophone selecting unit repeats processing of adding other allophone characters to a certain allophone character string containing an input text character by character at the front-end or the tail-end of the certain character string until and optimization score in the input text of an allophone character string obtained by such addition is reached, and selects the allophone character string before the addition as the allophone candidate character string, and; an allophone candidate selecting unit comprising one or more processors executing stored program instructions for selecting from the input text, at least one allophone candidate character string which is a candidate to be recognized as a word, a word in a sentence, and a sentence in a procedure; a pronunciation generating unit comprising one or more processors executing stored program instructions for generating at least one pronunciation candidate of each of the selected allophone candidate character strings on the basis of respective allophone characters contained in the selected allophone candidate character strings; and a word acquiring unit comprising one or more processors executing stored program instructions for selecting and outputting one of the generated allophone candidate character strings and corresponding one of the pronunciation candidates, on conditions that the selected pronunciation candidate is contained in the input text, and that two contexts in the input speech are similar to each other to an extent not less than a predetermined criterion, one of the contexts having the selected pronunciation candidate appear, and the other of the contexts having the selected allophone candidate character string appear. 